Capturing sequences adjacent to type-IIS restriction sites for genomic library mapping

ABSTRACT

The present invention relates to novel methods for sequencing and mapping genetic markers in polynucleotide sequences using Type-IIs restriction endonucleases. The methods herein described result in the “capturing” and determination of specific oligonucleotide sequences located adjacent to Type-IIs restriction sites. The resulting sequences are useful as effective markers for use in genetic mapping, screening and manipulation.

[0001] This application is a continuation-in-part of U.S. applicationSer. No. 08/307,881, filed Sep. 16, 1994, which is hereby incorporatedby reference in its entirety for all purposes.

[0002] Research leading to the present invention was funded in part byNIH grant Nos. 5-F32-HG00105 and R01 HG00813-02, and the government mayhave certain rights to the invention.

[0003] The present invention generally relates to novel methods forisolating, characterizing and mapping genetic markers in polynucleotidesequences. More particularly, the present invention provides methods formapping genetic material using Type-IIs restriction endonucleases. Themethods herein described result in the “capturing” and determination ofspecific oligonucleotide sequences located adjacent to Type-IIsrestriction sites. The resulting sequences are useful as effectivemarkers for use-in genetic mapping, screening and manipulation.

BACKGROUND OF THE INVENTION

[0004] The relationship between structure and function of macromoleculesis of fundamental importance in the understanding of biological systems.These relationships are important to understanding, for example, thefunctions of enzymes, structural proteins and signalling proteins, waysin which cells communicate with each other, as well as mechanisms ofcellular control and metabolic feedback.

[0005] Genetic information is critical in continuation of lifeprocesses. Life is substantially informationally based and its geneticcontent controls the growth and reproduction of the organism and itscomplements. The amino acid sequences of polypeptides, which arecritical features of all living systems, are encoded by the geneticmaterial of the cell. Further, the properties of these polypeptides,e.g., as enzymes, functional proteins, and structural proteins, aredetermined by the sequence of amino acids which make them up. Asstructure and function are integrally related, many biological functionsmay be explained by elucidating the underlying structural features whichprovide those functions, and these structures are determined by theunderlying genetic information in the form of polynucleotide sequences.Further, in addition to encoding polypeptides, polynucleotide sequencesalso can be involved in control and regulation of gene expression. Ittherefore follows that the determination of the make-up of this geneticinformation has achieved significant scientific importance.

[0006] Physical maps of genomic DNA assist in establishing therelationship between genetic loci and the DNA fragments which carrythese loci in a clone library. Physical maps include “hard” maps whichare overlapping cloned DNA fragments (“contigs”) ordered as they arefound in the genome of origin, and “soft” maps which consist of longrange restriction enzyme and cytogenetic maps (Stefton and Goodfellow,1992). In the latter case, the combination of rare cutting restrictionendonucleases (e.g., NotI) and pulse gel electrophoresis allows for thelarge scale mapping of genomic DNAS. These methods provide a lowresolution or top down approach to genomic mapping.

[0007] A bottom up approach is exemplified by construction of contiguousor “contig” maps. Initial attempts to construct contig maps for thehuman genome have been based upon ordering inserts cloned into cosmids.More recent studies have utilized yeast artificial chromosomes (YACS)which allow for cloning larger inserts. The construction of contig mapsrequire that many clones be examined (4-5 genome equivalents) in orderto assure that sufficient overlap between clones is achieved. Currently,four approaches are used to identify overlapping sequences.

[0008] The first method is restriction enzyme fingerprinting. Thismethod involves the electrophoretic sizing of restriction enzymegenerated DNA fragments for each clone and establishing a criterion forclone overlap based on the similarity of fragment sets produced for eachclone. The sensitivity and specificity of this approach has beenimproved by labelling of fragments using ligation, and end-fillingtechniques. The detection of repetitive sequence elements (e.g.,[GT]_(n)) has also been employed to provide characteristic markers.

[0009] The second method generally employed in mapping applications isthe binary scoring method. This method involves the immobilization ofmembers of a clone library to filters and hybridization with sets ofoligonucleotide probes. Several mathematical models have been developedto avoid the need for large numbers of the probe sets which are designedto detect the overlap regions.

[0010] A third method is the Sequence Tagged Site (“STS”) method. Thismethod employs PCR techniques and gel analysis to generate DNA productswhose lengths characterize them as being related to common regions ofsequence that are present in overlapping clones. The sequence of theprimary pairs and the characteristic distance between them providessufficient information to establish a single copy landmark (SCL) whichis analogous to single copy probes that are unique in the entire genome.

[0011] A fourth method uses cross-hybridizing libraries. This methodinvolves the immobilization of two or more pools of cosmid librariesfollowed by cross-hybridization experiments between pairs of thelibraries. This cross-hybridization demonstrates shared cloned sequencesbetween the library pairs. See, e.g., Kupfer, et al., (1995) Genomics27:90-100.

[0012] Although each of these methods is capable of generating usefulphysical maps of genomic DNA, they each involve complex series ofreaction steps including multiple independent synthesis, labelling anddetection procedures.

[0013] Traditional restriction endonuclease mapping techniques, i.e., asdescribed above, typically utilize restriction enzymerecognition/cleavage sites as genetic markers. These methods generallyemploy Type-II restriction endonucleases, e.g., EcoRI, HindIII andBamHI, which will typically recognize specific palindromic nucleotidesequences, or restriction sites, within the polynucleotide sequence tobe mapped, and cleave the sequence at that site. The restrictionfragments which result from the cleavage of separate fragments of thepolynucleotide (i.e., from a prior digestion) are then separated bysize. Overlap is shown where restriction fragments of the same sizeappear from Type-II endonuclease digestion of separate polynucleotidefragments.

[0014] Type-IIs endonucleases, on the other hand, generally recognizenon-palindromic sequences. Further, these endonucleases generally cleaveoutside of their recognition site, thus producing overhangs of ambiguousbase pairs. Szybalski, 1985, Gene 40:169-173. Additionally, as a resultof their non-palindromic recognition sequences, the use of Type-IIsendonucleases will generate more markers per Kb than a similar Type-IIendonuclease, e.g., approximately twice as often.

[0015] The use of Type-IIs endonucleases in mapping genomic markers hasbeen described in, e.g., Brenner, et al., P.N.A.S. 86:8902-8906 (1989).The methods described involved cleavage of genomic DNA with a Type-IIsendonuclease, followed by polymerization with a mixture of the fourdeoxynucleotides as well as one of the four specific fluorescentlylabelled dideoxynucleotides (ddA, ddT, ddG or ddC). Each successiveunpaired nucleotide within the overhang of the Type-IIs cleavage sitewould be filled by either a normal nucleotide or the labelleddideoxynucleotide. Where the latter occurred, polymerization stopped.Thus, the polymerization reaction yields an array of double strandedfluorescent DNA fragments of slightly different sizes. By reading thesize from smallest size to largest, in each of the nucleotide groups,one can determine the specific sequence of the overhang. However, thismethod can be time consuming and yields only the sequence of theoverhang region.

[0016] Oligonucleotide probes have long been used to detectcomplementary nucleic acid sequences in a nucleic acid of interest (the“target” nucleic acid). In some assay formats, the oligonucleotide probeis tethered, i.e., by covalent attachment, to a solid support, andarrays of oligonucleotide probes immobilized on solid supports have beenused to detect specific nucleic acid sequences in a target nucleic acid.See, e.g., U.S. patent application Ser. No. 08/082,937, filed Jun. 25,1993, which is incorporated herein by reference. Others have proposedthe use of large numbers of oligonucleotide probes to provide thecomplete nucleic acid sequence of a target nucleic acid but failed toprovide an enabling method for using arrays of immobilized probes forthis purpose. See U.S. Pat. Nos. 5,202,231 and 5,002,867.

[0017] The development of VLSIPS™ (Very Large Substrate ImmobilizedPolymer Synthesis) technology has provided methods for making very largecombinations of oligonucleotide probes in very small arrays. See U.S.Pat. No. 5,143,854 and PCT patent publication Nos. WO 90/15070 and92/10092, each of which is incorporated herein by reference in itsentirety for all purposes. U.S. patent application Ser. No. 08/082,937,incorporated above, also describes methods for making arrays ofoligonucleotide probes that can be used to provide the complete sequenceof a target nucleic acid and to detect the presence of a nucleic acidcontaining a specific nucleotide sequence.

[0018] The construction of genetic linkage maps and the development ofphysical maps are essential steps on the pathway to determining thecomplete nucleotide sequence of the human or other genomes. Presentmethods used to construct these maps rely upon information obtained froma range of technologies including gel-based electrophoresis,hybridization, polymerase chain reaction (PCR) and chromosome banding.These methods, while providing useful mapping information, are very timeconsuming when applied to very large genome fragments or other nucleicacids. There is therefore a need to provide improved methods for theidentification and correlation of genetic markers on a nucleic acidwhich can be used to rapidly generate genomic maps. The presentinvention meets these and other needs.

SUMMARY OF THE INVENTION

[0019] The present invention provides methods for identifying specificoligonucleotide sequences using Type-IIs endonucleases in sequentialorder to capture the ambiguous sequences adjacent to the Type-IIsrecognition sites. These ambiguous sequences can then be probedsequentially with probes specific for the various combinations ofpossible ambiguous base pair sequences. By determining which probehybridizes with an ambiguous sequence, that sequence is thus determined.Further, because that sequence is adjacent to a specific Type-IIscleavage site that portion of the sequence is also known. Thiscontiguous sequence is useful as a marker sequence in mapping genomiclibraries.

[0020] In one embodiment, the present invention provides a method ofidentifying sequences in a polynucleotide sequence. The method comprisescleaving the polynucleotide sequence with a first type-IIs endonuclease.A first adapter sequence, having a recognition site for a secondtype-IIs endonuclease, is ligated to the polynucleotide sequence cleavedin the first cleaving step. The polynucleotide sequence resulting fromthe first ligating step, is cleaved with the second type-IIsendonuclease, and a second adapter sequence is ligated to thepolynucleotide sequence cleaved in the second cleaving step. Thesequence of nucleotides of the polynucleotide sequence between the firstand second adapter sequences is then determined.

[0021] In another embodiment, the present invention provides a method ofgenerating an ordered map of a library of genomic fragments. The methodcomprises identifying sequences in each of the genomic fragments in thelibrary, as described above. The identified sequences in each fragmentare compared with the sequences identified in each other fragment toobtain a level of correlation between each fragment and each otherfragment. The fragments are then ordered according to their level ofcorrelation.

[0022] In a further embodiment, the present invention provides a methodof identifying polymorphisms in a target polynucleotide sequence. Themethod comprises identifying sequences in a wild-type polynucleotidesequence, according to the methods described above. The identifying stepis repeated on the target polynucleotide sequence. The differences inthe sequences identified in each of the identifying steps aredetermined, the differences being indicative of a polymorphism.

[0023] In still another embodiment, the present invention provides amethod of identifying a source of a biological sample. The methodcomprises identifying a plurality of sequences in a polynucleotidesequence derived from the sample, according to the methods describedherein. The plurality of sequences identified in the identifying stepare compared with a plurality of sequences identically identified from apolynucleotide derived from a known source. The identity of theplurality of sequences identified from the sample with the plurality ofsequences identified from the known source is indicative that the samplewas derived from the known source.

[0024] In an additional embodiment, the present invention provides amethod of determining a relative location of a target nucleotidesequence on a polynucleotide. The method comprises generating an orderedmap of the polynucleotide according to the methods described herein. Thepolynucleotide is fragmented. The fragment which includes the targetnucleotide sequence is then determined, and a marker on the fragment iscorrelated with a marker on the ordered map to identify the approximatelocation of the target nucleotide sequence on the polynucleotide.

BRIEF DESCRIPTION OF THE DRAWINGS

[0025]FIG. 1 shows examples of combinations of Type-IIs endonucleasesuseful in the present invention. Gaps in the sequence illustrate thecleavage pattern of the first Type-IIs endonuclease, shown to the left,whereas arrows illustrate the cleavage points of the second Type-IIsendonuclease, shown to the right, when the recognition site for thatendonuclease is ligated to the first cleaved sequence. FIG. 1 also showsthe expected frequency of cleavage of the first Type-IIs endonuclease,the number of recognition sites in λ DNA, and the size of the sandwichedsequence.

[0026]FIG. 2 shows a schematic representation of an embodiment of thepresent invention for capturing Type-IIs restriction sites showing (1) afirst cleavage with EarI, (2) followed by a ligation to the 5′ overhangof a first adapter sequence, (3) cleavage with HgaI, (4) ligation tosecond adapter sequence followed by PCR amplification (5).

[0027]FIG. 3 shows a schematic representation of a preferred embodimentof the present invention using (1) a first cleavage with EarI followedby DNA polymerization of the overhang to yield a blunt end, (2) ligationto blunt end first adapter sequence, (3) melting off the unligatedadapter strand followed by DNA polymerization to extend dsDNA across thefirst adapter strand, (4) cleavage with HgaI at the Earl recognitionsite, (5) ligation of second adapter sequence to target sequence, and(6) amplification/transcription of the captured target sequence.

[0028]FIG. 4 shows the combinatorial design for an oligonucleotide arrayused to probe a four nucleotide captured ambiguous sequence. The probesupon the array are 15mers having the sequence3′C-T-G-C-G-w-x-y-z-C-T-T-C-T-C 5′, where -w-x-y-z- are determined bythe probe's position on the array. For example, the probe indicated bythe darkened square on the array shown will have the w-x-y-z sequence of-A-T-G-C-.

[0029]FIG. 5 shows the predicted and actual fluorescent hybridizationpattern of captured sequences from λ DNA as described in Example 1 uponan oligonucleotide array probe having the combinatorial design of FIG.4. Panel A shows the predicted hybridization pattern where the darkenedsquares indicate expected marker/probe hybridizations from capturedsequences from λ DNA cut with EarI and captured with HgaI bearingadapter sequences. The actual fluorescence of the hybridization is shownin panel B.

[0030]FIG. 6 shows a portion of known map of a yeast chromosomallibrary, illustrating the positions of each fragment of the librarywithin yeast chromosome IV.

[0031]FIG. 7A shows a plot of correlation coefficient scores amonghybridization patterns of yeast chromosomal fragments when usingType-IIs and adjacent sequences as markers. FIG. 7B shows the predicted“correlation” scores for Earl captured marker sequences for fiftysimulated sequences from yeast chromosome III. The inner product scoresfor pair-wise comparison of the sequences is plotted versus the percentoverlap of the sequences. FIG. 7C shows the same simulated correlationusing BbsI captured marker sequences. FIG. 7D shows a simulatedcorrelation using HphI captured marker sequences.

[0032]FIGS. 8A, 8B and 8C show a schematic representation of theidentification of polymorphic markers, using the methods of the presentinvention.

DESCRIPTION OF THE PREFERRED EMBODIMENT

[0033] In general, the present invention provides novel methods foridentifying and characterizing sequence based nucleic acid markers aswell as a method for determining their presence. The methods maygenerally be used for generating maps for large, high molecular weightnucleic acids, i.e., for mapping short clones, cosmids, YACs, as well asin methods for genetic mapping for entire genomes. Generally, themethods of the present invention involve the capturing of ambiguousnucleic acid sequence segments using sequential cleavage withrestriction endonucleases. In particular, the methods of the inventioninclude a first cleavage which leaves ambiguous sequences downstreamfrom the recognition site of the cleavage enzyme. A second type-IIsrecognition site is ligated to the target sequence, and a secondcleavage, recognizing the second site, cleaves upstream from the firstcleavage site, within the first recognition site, resulting in shortsequences which contain the recognition site and an ambiguous sequence“captured” from the target sequence, between the two cleavage sites. Thecombination of the recognition site and the captured sequences areparticularly useful as genetic markers for genomic mapping applications.

[0034] In one embodiment, the methods of the present invention comprisethe use of type-IIs endonucleases to capture sequences adjacent to thetype-IIs recognition site. These captured sequences then becomeeffective sequence based markers. More particularly, this methodcomprises first treating the polynucleotide sequence with a firstType-IIs endonuclease having a specific recognition site on thesequence, thereby cleaving the sequence. A first “adapter sequence”which comprises a second Type-IIs endonuclease recognition site isligated to the cleaved sequence. The resulting heterologous sequencethus has an ambiguous sequence sandwiched between two different.Type-IIs endonuclease recognition sites. This resulting sequence is thentreated with a second Type-IIs endonuclease specific for the ligatedrecognition site, thereby cleaving the sequence. A second adaptersequence is then ligated to this cleaved sequence. The sequenceresulting from this ligation is then probed to determine the sequence ofthe sandwiched, or “captured”, ambiguous sequence.

[0035] I. Type-IIs Endonucleases

[0036] Type-IIs endonucleases are generally commercially available andare well known in the art. Like their Type-II counterparts, Type-IIsendonucleases recognize specific sequences of nucleotide base pairswithin a double stranded polynucleotide sequence. Upon recognizing thatsequence, the endonuclease will cleave the polynucleotide sequence,generally leaving an overhang of one strand of the sequence, or “stickyend.”

[0037] Type-II endonucleases, however, generally require that thespecific recognition site be palindromic. That is, reading in the 5′ to3′ direction, the base pair sequence is the same for both strands of therecognition site. For example, the sequence      ↓ 5′ -G-A-A-T-T-C- 3′3′ -C-T-T-A-A-G- 5′              ↑

[0038] is the recognition site for the Type-II endonuclease EcoRI, wherethe arrows indicate the cleavage sites in each strand. This sequence ispalindromic in that both strands of the sequence, when read in the 5′ to3′ direction are the same.

[0039] The Type-IIs endonucleases, on the other hand, generally do notrequire palindromic recognition sequences. Additionally, these Type-IIsendonucleases also generally cleave outside of their recognition sites.For example, the Type-IIs endonuclease EarI recognizes and cleaves inthe following manner:               ↓ -C-T-C-T-T-C-N-N-N-N-N- (SEQ IDNO: 2) -G-A-G-A-A-G-n-n-n-n-n-                     ↑

[0040] where the recognition sequence is -C-T-C-T-T-C-, N and nrepresent complementary, ambiguous base pairs and the arrows indicatethe cleavage sites in each strand. As the example illustrates, therecognition sequence is non-palindromic, and the cleavage occurs outsideof that recognition site. Because the cleavage occurs within anambiguous portion of the polynucleotide sequence, it permits thecapturing of the ambiguous sequence up to the cleavage site, under themethods of the present invention.

[0041] Specific Type-IIs endonucleases which are useful in the presentinvention include, e.g., EarI, MnlI, PleI, AlwI, BbsI, BsaI, BsmAI,BspMI, Esp3I, HgaI, SapI, SfaNI, BbvI, BsmFI, FokI, BseRI, HphI andMboII. The activity of these Type-IIs endonucleases is illustrated inFIG. 1, which shows the cleavage and recognition patterns of theType-IIs endonucleases.

[0042] II. Capturing Ambiguous Sequences Adjacent to Type-IIsRestriction Sites

[0043] A general schematic of the capturing of the ambiguous sequencesis shown in FIGS. 2 and 3.

[0044] Treatment of the polynucleotide sequence sought to be mapped witha Type-IIs endonuclease, results in a cleaved sequence having a numberof ambiguous, or unknown, nucleotides adjacent to a Type-IIsendonuclease recognition site within the target sequence. Additionally,within this ambiguous region, an overhang is created. The recognitionsite and the ambiguous nucleotides are termed the “target sequence.” Theoverhang may be 2, 3, 4 or 5 or more nucleotides in length while theambiguous sequence may be from 4 to 9 or more nucleotides in length,both of which will depend upon the Type-IIs endonucleases used. Examplesof specific Type-IIs endonucleases for this first cleavage includeBsmAI, Earl, MnlI, PleI, AlwI, BbsI, BsaI, BspMI, Esp3I, HgaI, SapI,SfaNI, BseRI, HphI and MboII. Again, these first Type-IIs endonucleasesand their cleavage patterns are shown in FIG. 1, where the shaded regionto the left illustrates the recognition site of the first Type-IIsendonuclease, and gaps in the sequence illustrate the cleavage patternof the enzyme. cleavage of high molecular weight DNA with Earl leaves anoverhang of three ambiguous base pairs, as shown in FIGS. 2 and 3,step 1. The recognition site of Earl is indicated by the bar. Thus, Earlcleavage of the target nucleic acid will produce a sequence having thefollowing cleavage end: -C-T-C-T-T-C-N- (SEQ ID NO:2)-G-A-G-A-A-G-n-n-n-n-

[0045] The overhanging bases are then filled in. This is preferablycarried out by treatment of the target sequence with a DNA polymerase,such as Klenow fragment or T4 DNA polymerase, resulting in a blunt endsequence as shown in FIG. 3, step 1. Alternatively however, the overhangmay be filled by the hybridization of this overhang with an adaptersequence having an overhang complementary to that of the targetsequence, as shown in FIG. 2, step 2. A tagging scheme, similar to thislatter method has been described. See, D. R. Smith, PCR Meth. and Appl.2:21-27 (1992).

[0046] Following cleavage and fill in of the overhang portion, anadapter sequence is typically ligated to the cleavage end. The adaptersequences described in the present invention generally are specificpolynucleotide sequences prepared for ligation to the target sequence.In preferred embodiments, these sequences will incorporate a secondtype-IIs restriction site. Ligation of an adapter including a HgaIrecognition site is shown in FIGS. 2 and 3, step 2. The adaptersequences are generally prepared by oligonucleotide synthesis methodsgenerally well known in the art, such as the phosphoramidite orphosphotriester methods described in, e.g., Gait, OligonucleotideSynthesis: A Practical Approach, IRL Press (1990).

[0047] An adapter sequence prepared to include a second type-IIsrecognition site, for example, the HgaI recognition site 3′-C-G-C-A-G-5′would be ligated to the cleaved target sequence to provide a cleavagesite on the other end of the ambiguous sequence. For example, ligationof the HgaI adapter to the target sequence would produce the followingsequence having the cleavage pattern shown: ↓-C-T-C-T-T-C-N-N-N-N-G-C-G-T-C- -G-A-G-A-A-G-n-n-n-n-C-G-C-A-G-          ↑

[0048] In addition to the Type-IIs recognition sites, preferred adaptersequences will also generally include PCR primers and/or promotersequences for in vitro transcription, thereby facilitating amplificationand labeling of the target sequence.

[0049] The method of ligation of the first adapter sequence to thetarget sequence may be adapted depending upon the particular embodimentpracticed. For example, where ligation of the first adapter sequence isto the overhang of the target sequence, as shown in FIG. 2, step 2, theadapter sequence will generally comprise an overhang which iscomplementary to the overhang of the target sequence. For thisembodiment, a mixture of adapter sequences would generally be usedwherein all possible permutations of the overhang are present. Forexample, the number of specific probe sequences will typically be about4^(m) where m is the number of overhanging nucleotides. For example,where the target sequence after the first cleavage has a 4 base pairoverhang of ambiguous nucleotides, the mixture of sequences wouldtypically comprise adapters having upwards of 4⁴, or 256 differentoverhang sequences. Where the overhang in question includes greaternumbers of nucleotides, the adapters would generally be provided in twoor more separate mixtures to minimize potential ligation of the adapterswithin each mixture. For example, one set of adapters may incorporate apyrimidine nucleotide in a given position of the overhang for alladapters in the mixture whereas the other set will have a purinenucleotide in that position. As a result, ligation of the adapters toadapters in the same mixture will be substantially reduced. For longeroverhang sequences, it may often be desirable to provide additionalseparate mixtures of adapters. Ligation of the adapter sequence to thetarget sequence is then carried out using a DNA ligase according tomethods known in the art.

[0050] Where the overhang of the target sequence is filled in by Klenowfragment polymerization, as in FIG. 3, step 1, a blunt end adaptersequence is ligated to the target sequence. See, FIG. 3, step 2. Becausea blunt end ligation is used rather than an overhang, a mixture ofhybridizable sequences is unnecessary, and a single adapter sequence isused. Further, this method avoids any hybridization between theoverhangs in the mixture of adapter sequences.

[0051] Using this method, the polymerized target sequence will bephosphorylated on only the 5′ strand. Further, as the adapter will haveonly 3′ and 5′ hydroxyls for ligation, only the 3′ end of the adapterwill be ligated to the blunt, phosphorylated 5′ end of the targetsequence, leaving a gap in the other strand. The unligated strand of theadapter sequence may then be melted off and the remaining polynucleotideagain treated with DNA polymerase, e.g., Klenow or E. coli DNApolymerase, as shown in FIG. 3, step 3, resulting in a double-stranded,heterologous polynucleotide. This polynucleotide has the ambiguousnucleotide sequence sandwiched between the first Type-IIs endonucleaserecognition site (“site A”), and the second, ligated Type-IIsrecognition site (“site B”). One skilled in the art will recognize thatapproximately half of the adapter sequences will ligate to the targetsequence in an inverted orientation. However, this does not affect theresults of the methods of the present invention due to the inability ofthe second type-IIs enzyme to cleave the target sequence in those caseswhere the adapter is inverted. This is discussed in greater detail,below.

[0052] The polynucleotide resulting from ligation of this first adaptersequence to the target sequence is then treated with a second Type-IIendonuclease specific for the ligated recognition site B. This secondendonuclease treatment cleaves the remainder of the originalpolynucleotide from the target sequence. In preferred aspects, thesecond type-IIs endonuclease will be selected, or the second recognitionsite will be positioned within the adapter sequence, whereby thecleavage pattern of the second Type-IIs endonuclease results in thesecond cleavage substantially or entirely overlapping the firstrecognition site A, i.e., the cleavage of each strand is within oradjacent to the first recognition site (site A). FIG. 2, step 3, andFIG. 3, step 4 show the cleavage of the polynucleotide using HgaI (theHgaI recognition site is shown by the bar). Where the adapter sequenceis ligated in a reverse orientation, as previously noted, no cleavagewill occur within the first recognition site, as the recognition sitewill be at the distal end of the adapter sequence. Further, any primersequences present within this adapter will be inverted preventingsubsequent amplification. By selecting a second Type-IIs endonucleasedifferent from the first, recleavage of the first cleavage site isavoided. Selection of an appropriate type-IIs endonuclease for thesecond cleavage, and thus, the appropriate recognition site for thefirst adapter sequence, may often depend upon the first endonucleaseused, or as described above, the position of the recognition site withinthe adapter. In preferred aspects, the first and second type-IIsendonucleases are selected whereby the second endonuclease cleavesentirely within the first endonucleases recognition sequence. Examplesof Type-IIs endonucleases for the second cleavage generally includethose described above, and are typically selected from HgaI, BbvI,BspMI, BsmFI and FokI. Particularly preferred combinations of Type-IIsendonucleases for the first and second cleavages, as well as theircleavage patterns are shown in FIG. 1. continuing with the previousexample, HgaI cleavage of the sample target sequence would produce thefollowing sequence having the ambiguous base pairs captured by the firstadapter sequence: -C-T-C-T-T-C-N-N-N-N-G-C-G-T-C-          -G-n-n-n-n-C-G-C-A-G

[0053] Depending upon the type-IIs endonucleases used in each step, thesequence of the overhang is known. For example, in the above example,the HgaI cleavage site for the second endonuclease is within the firstendonuclease's recognition site, e.g., the Earl site. An example of aknown overhang sequence is demonstrated in FIGS. 2 and 3, steps 4 and 5,respectively.

[0054] As noted, in the preferred aspects the second cleavage sitesubstantially or entirely overlaps the first recognition site A.Accordingly, the number of possible hybridizing sequences for thisligation step is rendered unique. The specific recognition site A of thefirst Type-IIs endonuclease is known. Thus, where the second cleavageoccurs entirely within the first recognition site A, only the uniquesequences hybridizing to that sequence would be used. On the other hand,where the second cleavage occurs to some extent outside of the firstrecognition site A, a mixture of specific adapter sequences hybridizableto all possible permutations of nucleotides outside of site A is used.For example, where cleavage incorporates one nucleotide outside of thefirst recognition site, the four variations to the known sequence arepossible and a mixture of adapter sequences hybridizable to all four isused (See, e.g., MnlI-HgaI enzyme pairing in FIG. 1). The number ofbases included in the second cleavage which fall outside the firstrecognition site is readily determinable from the endonucleases used.

[0055] As with the first adapter sequence, the second adapter sequencemay comprise a PCR primer sequence and/or a promoter sequence for invitro transcription.

[0056] The resulting target sequence will thus have the target sequence,specifically, an ambiguous sequence attached to a portion or all of thefirst recognition site, sandwiched or captured between the two adaptersequences. For example, the resulting target sequence will generallyhave the general sequence:

[0057] (Adapter sequence/A)-(Ambiguous sequence)-(B/Adapter sequence)

[0058] where A is a portion or all of the recognition site for the firstType-IIs endonuclease, and B is the recognition site for the secondType-IIs endonuclease. Again, applying the previous example, theresulting target sequence would appear as follows: Adapter 2-C-T-C-T-T-C-N-N-N-N-G-C-G-T-C- Adapter 1 Adapter 2′-G-A-G-A-A-G-n-n-n-n-C-G-C-A-G- Adapter 1′

[0059] The sequence -C-T-C-T-T-C-N-N-N-N- is captured from the originaltarget sequence and sandwiched between the two adapter sequences.

[0060] Prior to probing, the target sequence will generally be amplifiedto increase the detectability of the sequence. Amplification isgenerally carried out by methods well known in the art. See FIGS. 2 and3, steps 5 and 6, respectively. For example, amplification may beperformed by way of polymerase chain reaction (PCR) using methodsgenerally well known in the art. See, e.g., Recombinant DNA Methodology,Wu, et al., ed., Academic Press (1989), Sambrook, et al., MolecularCloning: A Laboratory Manual (2nd ed.), vols. 1-3, Cold Spring HarborLaboratory, (1989), Current Protocols in Molecular Biology, F. Ausubel,et al., ed., Greene Publishing and Wiley Interscience, New York (1987and periodic updates). As described earlier, this amplification may befacilitated by the incorporation of specific primer sequences orcomplements within the adapters. Further, such amplification may alsoincorporate a label into the amplified target sequence. In a preferredembodiment, the target sequence may be amplified using an asymmetric PCRmethod whereby only the strand comprising the appropriate recognitionsite A is amplified. Asymmetric amplification is generally carried outby use of primer which will initiate amplification of the appropriatestrand of the target sequence, i.e., the target sequence.

[0061] The amplified target sequence may then be probed using specificoligonucleotide probes capable of hybridizing to the (A)-(ambiguoussequence)-(B) target sequence. As both the A and B sequences are set bythe capturing method and are known, the probes need only differ withrespect to the ambiguous portion of the sequence to be probed. Forexample, using the example sequence provided above, assuming that one isprobing with the top strand, e.g., the bottom strand was amplified byappropriate selection of primers, etc., the probes would generally havethe sequence C-T-C-T-T-C-n-n-n-n-G-C-G-T-C, where n denotes everypossible base at the particular position, e.g., A, T, G, C. Thepreparation of oligonucleotide probes is performed by methods generallyknown in the art. See, Gait, Oligonucleotide Synthesis: A PracticalApproach, IRL Press (1990). Additionally, these oligonucleotide probesmay be labelled, i.e., fluorescently or radioactively, so that probeswhich hybridize with target sequences can be detected. In preferredaspects, however, the probes will be immobilized, and it will be thetarget that is labelled. Labelling of the target sequence may be carriedout using known methods. For example, amplification of the targetsequence can incorporate a label into the amplified target sequence,e.g., by use of a labelled PCR primer or by incorporating a label duringin vitro transcription of either strand.

[0062] In the preferred embodiment of the present invention, the targetsequence is probed using an oligonucleotide array. Through the use ofthese oligonucleotide arrays, the specific hybridization of a targetsequence can be tested against a large number of individual probes in asingle reaction. Such oligonucleotide arrays employ a substrate,comprising positionally distinct sequence specific recognition reagents,such as polynucleotides, localized at high densities. A single array cancomprise a large number of individual probe sequences. Further, becausethe probes are in known positionally distinct orientations on thesubstrate, one need only examine the hybridization pattern of a targetoligonucleotide on the substrate to determine the sequence of the targetoligonucleotide. Use and preparation of these arrays for oligonucleotideprobing is generally described in PCT patent publication Nos. WO92/10092, WO 90/15070, U.S. patent application Ser. Nos. 08/143,312 and08/284,064. Each of these references is hereby incorporated by referencein its entirety for all purposes.

[0063] As noted, the target sequence will have the general sequence:

[0064] (Adapter sequence/A)-(N_(k))-(B/Adapter sequence)

[0065] where N_(k) denotes the ambiguous sequence of nucleotides oflength k, and the nucleotide sequence of each adapter sequence is knownand the sequence of sites A and B are known. Only the nucleotidesequence of the ambiguous portion of the target sequence, N_(k) is notknown. Thus, the number of probes required on the array substrate isgenerally related to the number of ambiguous nucleotides in the targetsequence. In one embodiment, the number of potential sequences for anambiguous sequence is 4^(k), where k is the number of ambiguous baseswithin the sequence. For example, where there are four ambiguousnucleotides within the target sequence, the array would generallyinclude about 4⁴ or 256 or more separate probes, where each probe willinclude the general sequence:

[0066] (A′)-(N′_(k))-(B′)

[0067] where “A′” and “B′” are the complements to site-A and site-B ofthe target sequence, respectively and are constant throughout the array,and “N′_(k)” generally represents all potential sequences of the lengthof the ambiguous sequence of the target sequence. Thus, where theambiguous sequence contains, e.g., 4 nucleotides, “N′_(k)” wouldtypically include, for example, 4⁴ different sequences, at least one ofwhich will hybridize with the target sequence. On an oligonucleotidearray, this is accomplished through a simple combinatorial array likethat shown in FIG. 4. Typically, as the size of the ambiguous sequenceincreases, the number of probes on the array will also increase, e.g.,where the ambiguous sequence is 8 bases long, their will typically beabout 4⁸ or 65,536 probes on the array.

[0068] In the case of high molecular weight nucleic acids, the originalpolynucleotide sequence will generally comprise more than one and evenseveral specific Type-IIs endonuclease recognition/cleavage sites, e.g.,EarI sites. As a result, a number of ambiguous sequence segments will becaptured for a given polynucleotide. Upon probing with anoligonucleotide array, the sequence will hybridize with a number ofprobes which are complementary to all of the captured sequences,producing a distinctive hybridization pattern for the givenpolynucleotide sequence. The specific hybridization pattern of thetarget sequence upon the array will generally indicate the ambiguoussequences adjacent to all of the cleavage sites as was described above.

[0069] III. Mapping Genomic Libraries

[0070] A. Physical Maps

[0071] A further embodiment of the present invention provides a methodfor the ordered mapping of genomic libraries. Typically, the term“genomic library” is defined as a set of sequence fragments from alarger polynucleotide fragment. Such larger fragments may be wholechromosomes, subsets thereof, plasmids, or other similar largepolynucleotides. Specifically, the methods of the present invention areuseful for mapping high molecular weight polynucleotides includingchromosomal fragments, cosmids and Yeast Artificial Chromosomes (YACs).

[0072] Mapping techniques typically involve the identification ofspecific genetic markers on individual polynucleotide fragments from agenomic library. Comparison of the presence and relative position ofspecific-markers on fragments generated by different cleavage patternsallows for the assembly of a contiguous genomic map, or “contig”.

[0073] Accordingly, in a particularly preferred aspect of the presentinvention methods of genomic mapping are provided utilizing the sequencecapturing methods already described. In particular, the methods of thepresent invention comprise identifying the Type-IIs and adjacentsequences (target sequences) on the individual fragments of a genomiclibrary using the methods described above. FIG. 6 shows a genomic mapfor a portion of a yeast chromosome library, showing the overlap betweenthe various fragments of the library.

[0074] The individual fragments of the library are treated using theabove methods to capture the Type-IIs restriction sites and theiradjacent ambiguous sequences. These captured sequences are then used asgenetic markers, as described above, and a contig of the particularlibrary may be assembled. In the preferred aspects, the capturedType-IIs and adjacent sequences will be hybridized to specificpositionally oriented probes on the array. By determining the variousprobe sequences to hybridize with the captured sequences, these capturedsequences are thereby determined.

[0075] The combination of these mapping techniques with oligonucleotidearrays provides the capability of identifying a large number of geneticmarkers on a particular sequence. Typically, a genomic fragment willhave more than one, and even several Type-IIs restriction sites withinits sequence. Thus, when probed with an oligonucleotide array, thecaptured sequences from a particular genomic fragment will hybridizewith a number of probes on the array, producing a distinctivehybridization pattern. Each hybridization pattern will generallycomprise hybridization signals which correspond to each of the capturedsequence markers in the fragment.

[0076] When repeated on separate fragments from the library, eachfragment will generally produce a distinctive hybridization pattern,which reflects the sequences captured using the specific type-IIscapture method. These hybridization patterns may be compared withhybridization patterns from differentially generated fragments. Where aspecific marker is present in both fragments, it is an indication ofpotential overlap between the fragments. Two fragments that shareseveral of the same Type-IIs sequences, e.g., overlapping fragments,will show similar hybridization patterns on the oligonucleotide array.

[0077] The greater the similarity or correlation between two fragments,the higher the probability that these fragments share an overlappingsequence. By correlating the hybridization pattern of each fragment inthe library against each other fragment in the library, a singlecontiguous map of the particular library can be constructed.

[0078] In practice, each fragment is correlated to each other fragment,and a correlation score is given based upon the number of probes whichcross-hybridize with the Type-IIs and adjacent sequences of both thefirst and second fragment. High scores indicates high overlap. Forexample, a perfect overlap, i.e., the comparison of two identicalsequences would produce a correlation score of 1. Similarly, sequencessharing no overlapping sequence would, ideally, produce a correlationscore of 0. However, in practice, sequences that do not overlap willgenerally have correlation scores above zero, due to potentialnon-specific hybridizations, e.g., single base mismatches, backgroundhybridization, duplicated sequences, which may provide some baselinecorrelations between otherwise unrelated fragments. As a result, acutoff may be established below which correlation scores are not used.The precise cutoff may vary depending upon the level of nonspecifichybridizations for the particular application. For example, by usingcapture methods that cut less frequently, and/or capture a greaternumber of sequences, the potential for duplicated markers issubstantially reduced, and the cutoff may be lower. Correlation scoresamong all of the fragments may then be extrapolated to provideapproximate percent overlap among the various fragments, and from thisdata, a contiguous map of the genomic library can be assembled (FIG.7A). Additionally, one of skill in the art will appreciate that a morestringent determination of cosmid overlap may be obtained by repeatingthe capture and correlation methods using a different enzyme system,thereby generating additional, different markers and overlap data.

[0079] The combined use of sequence based markers and oligonucleotidearrays, as described herein, provides a method for rapidly identifying alarge number of genetic markers and mapping very large nucleic acidsequences, including, e.g., cosmids, chromosome fragments, YACs and thelike.

[0080] The present invention also provides methods for diagnosing agenetic disorder wherein said disorder is characterized by a mutation ina sequence adjacent to a known Type-IIs endonuclease restriction siteusing the methods described above. Specifically, sequences adjacent toType-IIs restriction sites are captured and their sequence is determinedaccording to the methods described above. The determined sequence isthen compared to a “normal” sequence to identify mutations.

[0081] A. Genetic Linkage Mapping

[0082] Genetic linkage markers are defined as highly polymorphicsequences which are uniformly distributed throughout a genome. In anadditional embodiment, the methods of the present invention are used toidentify and define these polymorphic markers. Because these markers areidentified and defined by their proximity to type-IIs restriction sites,they are referred to herein as restriction site sequence polymorphisms(“RSSPs”). In general, these RSSP markers are identified by comparingcaptured sequences among two genomes. The methods of the presentinvention may generally be used to identify these RSSPs in a number ofways. For example, a polymorphism within the recognition site of thetype-IIs endonuclease will result in the presence of a captured sequencein one genome where it is absent in the other. This is generally theresult where the polymorphism lies within the type-IIs recognition site,thereby eliminating the recognition site in the particular sequence,and, as a result, the ability to capture the adjacent sequences. It willbe appreciated that the inverse is also true, that a polymorphism mayaccount for the presence of a recognition site where one does not existin the wild type. Second, a polymorphism may be identified which lieswithin the captured ambiguous sequence. These polymorphisms willtypically be detected as a sequence difference between the comparedgenomes.

[0083] A wide variety of polymorphic markers may be identified for anygiven genome, based upon the type-IIs enzymes used for the first andsecond cleavages. For example, first cleavage enzymes which recognizedistinct sequences will typically also define a number of distinctproximal polymorphisms.

[0084] The above described methods may be further modified, for example,using methods similar to those reported by Nelson, et al., NatureGenetics (1993) 4:11-18. Nelson, et al. report the identification ofpolymorphic markers using a system of genetic mismatch scanning. In themethod of Nelson, et al., the genomes to be compared, e.g., grandchildand grandparent genomes, are first digested with an endonuclease whichproduces a 3′ overhang, i.e., PstI. One of the two genomes is methylatedat all GATC sites in the sequence (DAM+) while the other remainsunmethylated (DAM−). The genomic fragments from each group aredenatured, mixed with each other, and annealed, resulting in a mixtureof homohybrids and heterohybrids. In the homohybrids, both strands willbe either methylated or unmethylated, while in the heterohybrids, onestrand will be methylated. The mixture is then treated with nucleaseswhich will not cleave the hemimethylated nucleic acid duplexes, forexample DpmI and MboI. Next, the mixture is treated with a series ofmismatch repair enzymes, e.g., MutH, MutL and MutS, which introduce asingle strand nick on the duplexes which possess single base mismatches.The mixture is then incubated with ExoIII, a 3′ to 5′ exonuclease whichis specific for double stranded DNA, and which will degrade thepreviously digested homohybrids and the nicked strand of the mismatchedheterohybrids, from the 3′ side. Purification of the full dsDNA is thencarried out using methods known in the art, e.g., benzoylatednaphthoylated DEAE cellulose at high salt concentrations, which willbind ssDNA but not dsDNA. As a result, only the full-length, unaltered(perfectly matched) heterohybrids are purified. The recovered dsDNAfragments which indicate “identity by descent” (or “i.b.d.”) arelabelled and used to probe genomic DNA to identify sites of meioticrecombination.

[0085] An adaptation of the above method can be applied to the capturemethods of the present invention. In particular, the methods of thepresent invention can be used to capture sequences in the region ofpolymorphisms in a particular polynucleotide sequence. FIGS. 8A, 8B and8C show a schematic representation of the steps used in practicing oneembodiment of this aspect of the present invention. Specifically, asubset of genomic DNA which is identified by the presence of a type-IIsrecognition site is amplified (FIG. 8A), DNA containing polymorphismswithin the amplified subset are isolated (FIG. 8B), and the sequencesadjacent to the type-IIs recognition site in the isolatedpolymorphism-containing sequences are identified and characterized (FIG.8C).

[0086] Initially, polynucleotides from different sources which are to becompared, e.g., grandparent-grandchild, etc., are treated identically inparallel systems. These polynucleotides are each cleaved with a firsttype-IIs endonuclease, as is described in substantial detail above. InFIG. 8A, step (a), for example, this first cleavage is shown usingBseR1. The specific Type-IIs enzyme used in this first cleavage mayagain vary depending upon the desired frequency of cleavage, the lengthof the target sequence, etc.

[0087] As previously described, a first adapter bearing a secondtype-IIs endonuclease recognition site is ligated to the cleavedpolynucleotides (FIG. 8A, step (b)). In the example of FIG. 8A, steps(a), (b) and (c), this recognition site is that of the type-IIsendonuclease FokI. The polynucleotides are then cleaved with anendonuclease which will cleave upstream from the captured sequence andligated first adapter, such as a type II endonuclease, e.g., HaeIII (seeFIG. 8A, step (d)). Typically, this second cleavage enzyme will beselected whereby it cleaves more frequently than the first Type-IIsenzyme. A second adapter sequence may then be ligated to this newcleavage site (FIG. 8A, step (e)). The entire sequence, including thetwo adapter sequences is then typically amplified (FIG. 8A, step (f)).The amplification is facilitated in preferred aspects by incorporating aprimer sequence within the adapter sequences.

[0088] The amplified polynucleotides from each source is isolated (FIG.8B, step (g)). The polynucleotide from one source is then methylated(FIG. 8B, step (h)). Both the methylated polynucleotide from the firstsource and the unmethylated polynucleotide from the second source aremixed together, heated to denature duplex DNA, and reannealed (FIG. 8B,step (i)). This generally results in a mixture of hemimethylatedheterohybrids having one strand from each source, homohybrids ofunmethylated dsDNA and homohybrids of fully methylated dsDNA. At thispoint, unlike the method of Nelson, et al. (DpmI and MboI additions areomitted), the mixture is treated with the mismatch repair enzymes, e.g.,MutLSH, which will nick only hemimethylated, mismatched hybrids, leavingthe homohybrids and perfectly matched heterohybrids untouched (FIG. 8B,step (j)). The nicked DNA is then digested, as in Nelson, et al., withan exonuclease, e.g., ExoIII (FIG. 8B, step (k)). The mixture will thencontain dsDNA which is fully methylated, i.e., homohybrids of DNA fromone source, dsDNA which is unmethylated, i.e., homohybrids of DNA fromthe other source, heterohybrids of dsDNA from both sources, but whichare perfectly matched, i.e., contains no mismatches or polymorphisms,and ssDNA, i.e., the DNA which is left from the heterohybrid, mismatchedor polymorphic dsDNA. This ssDNA reflects the polymorphism and may thenbe purified from the dsDNA using the methods described in Nelson, etal., e.g., purification over benzoylated naphthoylated DEAE cellulose inhigh salt (FIG. 8C, step (1)).

[0089] The purified single stranded DNA is then reamplified to dsDNAusing methods well know in the art, e.g., PCR (FIG. 8C, step (m)). Theamplified DNA may then cleaved with a second type-IIs endonuclease whichrecognizes the site incorporated into the first adapter sequence, asdescribed above (FIG. 8C, step (n)), followed by ligation of anotheradapter sequence to the cleavage end (FIG. 8C, step (o)). The capturedsequence thus identifies a polymorphism is which lies between thecaptured sequence and the upstream cleavage site. The captured sequencemay then be determined according to the methods described herein, e.g.,amplification, labelling and probing (FIG. 5C, step (p)).

[0090] IV. Applications

[0091] The methods described herein are useful in a variety ofapplications. For example, as is described above, these methods can beused to generate ordered physical maps of genomic libraries, as well asgenetic linkage maps which can be used in the study of genomes ofvarying sources. The mapping of these genomes allows further study andmanipulation of the genome in diagnostic and therapeutic applications,e.g., gene therapy, diagnosis of genetic predispositions for particulardisorders and the like.

[0092] In addition to pure mapping applications, the methods of thepresent invention may also be used in other applications. In a preferredembodiment, the methods described herein are used in the identificationof the source of a particular sample. This application would includeforensic analysis to determine the origin of a particular tissue sample,such as analyzing blood or other evidence in criminal investigations,paternity investigations, etc. Additionally, these methods can also beused in other identification applications, for example, taxonomic studyof plants, animals, bacteria, fungi, viruses, etc. This taxonomic studyincludes determination of the particular identity of the species fromwhich a sample is derived, or the interrelatedness of samples from twoseparate species.

[0093] The various identification applications typically involve thecapturing and identification of sequences adjacent specific type-IIsrestriction sites in a sample to be analyzed, according to the methodsalready described. These sequences are then compared to sequencesidentically captured and identified from a known source. Where sequencescaptured from both the sample and the source are identical or highlysimilar, it is indicative that the sample was derived from the source.Where the sequences captured from the sample and known source share alarge number of identical sequences, it is indicative that the sample isrelated to the known source. However, where the sample and source sharefew like sequences, it is indicative of a low probability ofinterrelation.

[0094] Precise levels of interrelation to establish a connection betweensource and sample, i.e., captured sequence homology, will typically beestablished based upon the interrelation which is being proved ordisproved, the identity of the known source, the precise method used,and the like. Establishing the level of interrelation is well within theordinary skill in the art. For example, in criminal investigations, ahigher level of homology between sample and known source sequences willlikely be required to establish the identity of the sample in question.Typically, in the criminal context, interrelation will be shown wherethere is greater than 95% captured sequence homology, preferably greaterthan 99% captured sequence homology, and more preferably, greater than99.9% captured sequence homology. For other identification applications,interrelation between sample and known source may be established by ashowing of, e.g., greater than 50% captured sequence homology, andtypically greater than 75% captured sequence homology, preferablygreater than 90% captured sequence homology, and more preferably greaterthan 95 to 99% captured sequence homology.

[0095] The level of interrelation will also typically vary dependingupon the portion of a genome or nucleic acid sequence which is used forcomparison. For example, in attempting to identify a sample as beingderived from one member of a species as opposed to another member of thesame species, it will generally be desirable to capture sequences in aregion of the species' genetic material which displays a lower level ofhomology among the various members of the same species. This results ina higher probability of the captured sequences being specific to onemember of the species. The opposite can be true for taxonomic studies,i.e., to identify the genus and species of the sample. For example, itmay generally be desirable to select a portion of the genetic materialof the known genus or species which is highly conserved among members ofthe genus and/or species, thereby permitting identification of theparticular sample to that genus or species.

[0096] The present invention is further illustrated by the followingexamples. These examples are merely to illustrate aspects of the presentinvention and are not intended as limitations of this invention. Themethods used generally employ commercially available reagents orreagents otherwise known in the art.

EXAMPLES Example 1

[0097] 1. Digesting High Molecular Weight DNA with Ear1

[0098] 4 μg of λ DNA was treated with 4 units of EarI in 10 μl at 37° C.for 4 hours. The reaction was then heated to 70° C. for 10 minutes.Cleavage was verified by running 5 μl of the sample on an agarose gel todetermine complete cleavage. The remaining 5 μl was brought to 40 μl(final concentration of 50 ng/μl λ DNA).

[0099] 2. Klenow Fill-In Reaction

[0100] 4 μl of the digested λ DNA was added to 0.5 μl of 10× KlenowBuffer, 0.5 μl 2 mM dNTPs, and 0.05 μl of 0.25 units of Klenow fragment.The reaction mixture was incubated for 20 minutes at 25° C., followed by10 minutes at 75° C. Similar results were also obtained using T4 DNApolymerase for the fill-in reaction.

[0101] 3. Preparing Adapter Sequences

[0102] Two separate adapter sequences were prepared, adapter sequence 1and adapter sequence 2. Adapter sequence 1 is used in the first ligationreaction whereas adapter 2 is used for the second. As each adapter andits ligation are somewhat different, they are addressed separately.

[0103] Double stranded adapter 1 comprising the second Type-IIsendonuclease restriction site 3′C-G-C-A-G- . . . 5′ and a T7 promotersequence was prepared by adding 10 μl each of 10 μM unphosphorylated T7strand and its complement, heating the mixture to 95° C., then coolingover 20 minutes to anneal the strands. The strands were prepared usingDNA synthesis methods generally well known in the art. The resultingmixture had a final dsDNA adapter concentration of 5 μM.

[0104] Adapter 2 comprising the overhang complementary to that createdby the HgaI digestion of the target sequence, as well as a T3 promotersequence was prepared by first creating the overhang region. A singlestranded oligonucleotide of the sequence 3′ . . . -G-A-G-A-A 5′ wassynthesized on a single stranded T3 promoter sequence. The finalconcentration of reagents is shown in parentheses. The 5′ end of thissequence was then phosphorylated as follows: 10 μl of 10 μM theoligonucleotide (5 μM), 2 μl of 10× kinase buffer (1×), 2 μl 10 mM ATP(1 mM), 5 μl water and 1 μl T4 polynucleotide kinase (10 units) wereadded. The reaction was incubated at 37° C. for 60 minutes, then at 68°C. for 10 minutes and cooled.

[0105] To the T3/overhang ssDNA strand was added 10 μl of 10 μMappropriate antistrand and 3.33 μl of buffer. This mixture was heated to95° C. and cooled over 20 minutes to anneal the two strands.

[0106] 4. Ligation of First Adapter to Target Sequence

[0107] At least a 50:1 molar ratio of first adapter to cleavage ends wasdesired and an approximate ratio of 100:1 adapters to cleavage ends wastargeted. As λ DNA digested with EarI is known to result in 34 pairs ofcleavage ends, a 3400:1 mole ratio of adapters to λ DNA was used.

[0108] In 11 μl total reaction mixture, the following were combined, 5μl from the fill-in reaction (approx. 40 nmoles target DNA), 4 μl of 5μM first adapter (2 μM final concentration), 1.1 μl 10× ligation buffer(1× final concentration), and 1 μl of T4 DNA ligase (400 units finalconcentration).

[0109] The reaction was incubated at 25° C. for 2 hours, then incubatedat 75° C. for 10 minutes to inactivate the ligase as well as dissociateunligated adapter strand.

[0110] 5. Second Klenow Fill-In Reaction

[0111] Filling in the single stranded portion of the targetsequence/first adapter created by dissociation of the unligated strandin step 4 above, was accomplished using the Klenow fragment DNApolymerase.

[0112] In 14 μl total was added 11 μl of DNA to which the first adapterhad been ligated (approx. 34.4 nM total adapted ends), 1.5 μl 10× Klenowbuffer (1×), 1.5 μl 2 mM dNTPs (50 μM each dNTP) and 0.05 μl Klenowfragment (0.25 units). This mixture was incubated at 37° C. for 30minutes, then heated to 75° C. for 10 minutes. Again, similar resultswere obtained using E. coli DNA polymerase.

[0113] 6. Second Digestion with HgaI

[0114] To the 14 μl reaction mixture of step 6 was added 1 μl of HgaI (2units). The reaction was incubated at 25° C. for 3 hours. 1.6 μl of 5 MNaCl (0.5 M) was then added to raise the melting point of the targetsequence to above 70° C. The reaction mixture was then heated to 65° C.for 20 minutes.

[0115] 7. Ligation of Second Adapter to Target Sequence

[0116] The 16 μl reaction mixture from step 7 is expected to have anapproximate concentration of 4.4 nM target sequence with compatible endsfor the second ligation. This number is halved from the expectedconcentration of total target sequence. This was to account for theblunt end ligation of adapter 1 in the reverse orientation such thatHgaI cleavage would not occur.

[0117] To the 16 μl reaction mixture from step 7, was added 5 μl of 3 μMsecond adapter prepared in step 3, above (0.3 μM), 5 μl 10× ligationbuffer (1×), 23.5 μl water and 0.5 μl T4 DNA ligase (200 units). Thereaction mixture was incubated at 37° C. for 30 minutes then heated to65° C. for 10 minutes.

[0118] 8. PCR Amplification

[0119] 5 μl of the captured target sequence from step 7 is used as thetemplate for PCR amplification (approx. 440 μM total; 14.7 μM each end).To this was added 1.25 μl each of 10 μM T7 primer, and 10 μM T3 primer(0.25 PM primer), 5 μl 10× PCR buffer (1×), 5 μl 4× 2 mM dNTPs (200 μMeach dNTP), 24.5 μl water and 0.5 μl Taq polymerase (2.5 units).

[0120] PCR was carried out for 40 cycles of 94° C. for 30 seconds, 55°C. for 30 seconds and 72° C. for 30 seconds. Controls were run usingwater, λ DNA cut with Earl and uncut λ DNA subjected to steps 1-7.2 μlfrom the reaction was run on a 4% NuSieve® Agarose gel, indicating a62-bp amplicon which is carried into the next step.

[0121] 9. Labelling-Asymmetric PCR

[0122] The 62-bp amplicon produced in step 8 is next labeled with a 5′-Flabel by asymmetric PCR. 44 μl of the PCR amplicon from step 8 (50fmoles) is mixed with 5 μl of 10 μM T7-5° F. primer (1 μM primer), 2 μlof lox PCR buffer (1× buffer), 3 μl of 100 μM MgCl₂ (5 mM), 5 μl of 4× 2mM dNTPs (200 mM each dNTP) and 0.5 μl Taq polymerase (2.5 units).

[0123] PCR was carried out for 40 cycles as described in step 8.3 μlfrom this reaction was the run on 4% NuSieve® Agarose gel and comparedto the amplicon from step 8 to confirm florescent labelling.

[0124] 9. Results

[0125] The florescent captured sequence was heated to 95° C. briefly,then buffered with 6× SSPE, 10 mM CTAB and 0.2% Triton X-100. Thecaptured sequence was then probed on an oligonucleotide array having thecombinatorial array shown in FIG. 4. FIG. 5, panel A shows the expectedhybridization pattern of λ DNA to the array of FIG. 4 as denoted by theblackened regions on the array. FIG. 5, panel B illustrates the actualhybridization pattern of captured Type-IIs sites from λ DNA on an arrayas shown in FIG. 4. The close correlation between expected and actualhybridization is evident.

Example 2

[0126] The above capture methods were applied to a genomic library of 12known cosmids from yeast chromosome IV. The clones have been previouslyphysically mapped using EcoRI-HindIII fragmentation. The specificlibrary, including known map positions and overlap of the 12 cosmids, isillustrated in FIG. 6.

[0127] The twelve genomic clones were constructed in a pHC79 vector, inE. coli host HB101. Cosmid DNA was prepared from 3 ml cultures by analkaline lysis miniprep method. The miniprep DNA was digested with EcoRIand HindIII to confirm the known fingerprint of the large clonedinserts. Cosmid DNA was treated with linear DNAase, Plasmid-Safe™ DNAse,at 37° C. for 15 minutes, followed by heat inactivation. The DNAsetreatment was carried out to remove any potential spurious EarI digestedsites resulting from contaminating bacterial DNA. This leaves cosmid DNAsubstantially untouched. After confirming the presence of clean bandingcosmid DNA, the resulting cosmids were then subjected to the capturemethods described above. The pCH79 vector, without a yeast insert, wastransformed into HB101 and isolated as a miniprep, to serve as acontrol.

[0128] The data from the array was normalized as follows. First, theprobe array was normalized for background intensity by subtracting thebackground scan (hybridization buffer with no target). Second, the datawas normalized to the specific vector used in producing the cosmids.Normalization to the vector had two parts: first the average intensityof four hybridizing markers present in pHC79 vector was calculated foreach scan, for use as an internal control in that scan. This intensitywas divided into all intensities in that scan, and second the overallbackground intensity of the pCH79 vector in a bacterial host, absent ayeast insert, was subtracted. The array signal was normalized forrelative hybridization of the probes on the array, by using equimolartarget mixtures for each probe. Finally, the four values correspondingto the pCH79 markers were discarded.

[0129] The resulting hybridization patterns were then correlated,pair-wise, between all cosmids. specifically, the signal intensity foreach probe was compared among the same probe's intensity for all otherfragments. Where the signals were the same, there was some correlation.The more signals that were the same, the higher the correlation score.

[0130] These correlation scores are plotted against the known percentoverlap for these cosmids as determined from the EcoRI/HindIII physicalmap. This plot is shown in FIG. 7A. As is apparent, the correlation ofhybridization scores between fragments is readily correlatable topercent overlap of the fragments.

Example 3 Simulated Annealing

[0131] The correlation scores from yeast chromosome IV, above, were usedto construct a best fitting contig, using the simulated annealingprocess as described by Cuticchia, et al., The use of simulatedannealing in chromosome reconstruction experiments based on binaryscoring, Genetics (1992) 132:591-601. A global maximum was sought forthe sum of correlation coefficient scores for a given sequence ofcosmids in the randomly constructed and permutated contig. The resultinghigh scoring contigs for all 12 cosmids and for the 10 “strong-signal”cosmids are shown below. Each cosmid was assigned a rank based upon theknown position of that cosmid, and these are as follows: TABLE 1 CosmidNumber Cosmid Rank 9371 A 8552 B 8087 1 9481 2 9858 3 9583 4 8024 5 82536 9509 7 9460 8 8064 9 9831 10 

[0132] Simulated annealing of all twelve cosmids produced the followingordering:

[0133] (1 2 3 4) (7 6 5) A B (8 9 10)

[0134] Inclusion of the weaker signal cosmids, A and B, results in someshuffling of the predicted order of the cosmids. Removal of cosmids Aand B, the “weak-signal” cosmids, produced the following ordered map ofthe remaining ten cosmids:

[0135] (1 2 3 4 5 6 7) (8 9 10)

[0136] which reflects the proper ordering and indicates the existence ofthe two “islands” of cosmids as seen in the physical map.

[0137] As can be seen, the inclusion of the weaker signal cosmids A andB, 8552 and 9731, inverts the order of clones in the center positions(5, 6 and 7), and improperly places

Example 4 Simulated Mapping of Yeast Chromosome III

[0138] To determine how well the distribution of points in FIG. 7Amatches the distribution of scores expected for a random set of yeastcosmids, a random set of fifty 35 to 40 kb sequences from yeastchromosome III (“YCIII”) were simulated. A list of perfect matchescorresponding to Earl associated tetramers was also generated. Due tothe difficulty in assigning simulated intensity scores for thesemarkers, the marker probes were scored as 1, and 0 for non markerprobes. Inner product scores were used instead of correlationcoefficients to determine the similarity of the marker sets in 1225comparisons of the fifty simulated YCIII cosmids. The scores wereplotted against expected overlap, and this is shown in FIG. 7B. Evenwhen perfect information regarding marker identities in the tetramersets is compared, a certain amount of scatter is seen in the plot.Additionally, comparison of sequences with no overlap generate innerproduct scores ranging from 0.05 to 0.4. These two features arecharacteristic of the actual data shown in FIG. 7A.

[0139] The simulation was repeated using BbsI and HphI as the firstcleaving enzyme, and the results are shown in FIGS. 7C and 7D,respectively. From this data, it can be seen that the amount of scatterin a particular plot is a function of the inverse of the frequency ofcleavage sites (e.g., number of markers) in the target sequence. Inparticular, using HphI as the first cleaving enzyme would produce 564markers in YCIII, whereas BbsI would yield 212 and Earl would yield 274.The scatter for the more frequently cutting HphI enzyme is substantiallyless than that for BbsI and EarI. Additionally, as noted previously, theY intercept is also affected by the number of markers in the targetsequence, as well as the frequency of a particular marker (e.g., markerduplication). Both of these factors may be influenced by the choice ofcapture methods and enzymes.

[0140] The above description is illustrative and not restrictive. Manyvariations of the invention will become apparent to those of skill inthe art upon review of this disclosure. The scope of the inventionshould, therefore, be determined not with reference to the abovedescription, but instead should be determined with reference to theappended claims along with their full scope of equivalents. Allpublications and patent documents cited in this application areincorporated by reference in their entirety for all purposes to the sameextent as if each individual publication or patent document were soindividually denoted.

What is claimed is:
 1. A method of identifying sequences in apolynucleotide sequence, comprising: first cleaving the polynucleotidesequence with a first type-IIs endonuclease; first ligating a firstadapter sequence to the polynucleotide sequence cleaved in said firstcleaving step, said first adapter having a recognition site for a secondtype-IIs endonuclease; second cleaving the polynucleotide sequenceresulting from said first ligating step, with the second type-IIsendonuclease; second ligating a second adapter sequence to thepolynucleotide sequence cleaved in said second cleaving step; anddetermining the sequence of nucleotides of the polynucleotide sequencebetween the first and second adapter sequences.
 2. The method of claim1, wherein: in said first cleaving step, the first type-IIs endonucleaseis selected from the group consisting of BsmAI, EarI, MnlI, PleI, AlwI,BbsI, BsaI, BspMI, Esp3I, HgaI, SapI, SfaNI, BseRI, HphI and MboII; andin said second cleaving step, the second type-IIs endonuclease isselected from the group consisting of HgaI, BbvI, BspMI, BsmFI and FokI.3. The method of claim 2, wherein in said first cleaving step, the firsttype-IIs endonuclease is EarI; and in said second cleaving step, thesecond type-IIs endonuclease is HgaI.
 4. The method of claim 1, whereinin said first and second ligating steps, said first and second adaptersequences comprise primer sequences.
 5. The method of claim 4, whereinprior to said determining step, the sequence of oligonucleotides in thepolynucleotide between the first and second adapter sequences isamplified.
 6. The method of claim 1, wherein in said determining step,the sequence of nucleotides between the first and second adaptersequences is determined by hybridization to an oligonucleotide probe. 7.The method of claim 6, wherein said oligonucleotide probe is apositionally distinct probe on an oligonucleotide array, a position ofthe probe being indicative of the sequence of the probe.
 8. A method ofgenerating an ordered map of a library of genomic fragments, the methodcomprising: identifying sequences in each of the genomic fragments inthe library, according to the method of claim 1; comparing the sequencesidentified in each fragment with the sequences identified in each otherfragment to obtain a level of correlation between each fragment and eachother fragment; and ordering the fragments according to their level ofcorrelation.
 9. A method of identifying polymorphisms in a targetpolynucleotide sequence, the method comprising: identifying sequences ina wild-type polynucleotide sequence, according to the method of claim 1,repeating said identifying step on the target polynucleotide sequence;and determining differences in the sequences identified in each of saididentifying steps, the differences being indicative of a polymorphism.10. The method of claim 1, wherein said sequences in a polynucleotidesequence are proximal to a polymorphism.
 11. A method of identifying asource of a biological sample, the method comprising: identifying aplurality of sequences in a polynucleotide sequence derived from thesample, according to the method of claim 1; and comparing the pluralityof sequences identified in said identifying step with a plurality ofsequences identically identified from a polynucleotide derived from aknown source, identity of the plurality of sequences identified from thesample with the plurality of sequences identified from the known sourcebeing indicative that the sample was derived from the known source. 12.A method of determining a relative location of a target nucleotidesequence on a polynucleotide, the method comprising: generating anordered map of the polynucleotide according to the method of claim 8;fragmenting the polynucleotide; determining which fragment includes thetarget nucleotide sequence; correlating a marker on the fragment with amarker on the ordered map to identify the approximate location of thetarget nucleotide sequence on the polynucleotide.